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Abstract 

We consider the problem of approximately reconstructing a partially-observed, approxi- 
mately low-rank matrix. This problem has received much attention lately, mostly using 
the trace-norm as a surrogate to the rank. Here we study low-rank matrix reconstruction 
using both the trace-norm, as well as the less-studied max-norm, and present reconstruc- 
tion guarantees based on existing analysis on the Rademacher complexity of the unit balls 
of these norms. We show how these are superior in several ways to recently published 
guarantees based on specialized analysis. 
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• 1 Introduction 
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O ' We consider the problem of (approximately) reconstructing an (approximately) low-rank matrix 

based on observing a random subset of entries. That is, we observe s randomly chosen entries of 
an unknown matrix Y g R"^™, where we assume either Y is of rank at most r, or there exists 
X G R"^™ of rank at most r that is close to Y. Based on these s observations, we would like to 
ff^ ■ construct a matrix X that is as close as possible to Y. 

' There has been much interest recently in computationally efficient methods for reconstructing a 

0^ . partially-observed, possibly noisy, low-rank matrix, and on accompanying guarantees on the quality 

' of the reconstruction and the required number of observations. Si nce directly searching for a low -rank 

! matrix minimizing the empirical reconstruction error is NP-hard (jChistov and Grigorievlll984[) . most 

i work has focused on using the trace-norm (a.k.a. nuclear norm, or Schatten-l-norm) as a surrogate 

for the rank. The trace-norm of a matrix is the sum (i.e. £i-norm) of its singular values, and thus 
relaxing the rank (i.e. the number of non-zero singular values) to the trace-norm is akin to relaxing 
the sparsity of a vector to its £i-norm, as is frequently done in compressed sensing. The analysis 
of the quality of reconstruction has also been largely driven by ideas coming from compressed 
sensing, typically studying the optimality conditions of the empirical optimization problem, and 
^ • often requiring various "incoherence" -type assumptions on the underlying low-rank matrix. 

In this paper we provide simple guarantees on approximate low-rank matrix reconstruction using 
a different surrogate regularizer: the 72:£i->^^ norm, wh ich we refer to simply as the "max-norm" . 
This regularizer was first suggested bv lSrebro et al.l (|2005h , though it has not received much attention 
since. Here we show how this regularizer can yield guarantees that are superior in some ways to 
recent state-of-the-art. In particular, we show that when the entries are uniformly bounded, i.e. 
|Xl = 0(1) (this corresponds to the "no spikiness" a ssumption of iNegahban and Wainwrightj 
(|2010D . and is also assumed bv iKoltchinskii et al.l ()2010l) and in the approximate reconstruction 
guarantee of iKeshavan et all lj2010r )V then the max-norm regularized predictor requires a sample 
size of 
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s = Ol^^—^-^—-log'{l/e)j (1) 

to achieve mean-squared reconstruction error — y|| — + e, where is the the mcan- 

squared-error of the best rank-r approximation of Y — that is, = ~ ^lii where X is the 

rank-r approximation. When Y is exactly low-rank (the noiseless case), ~ and the sample 
complexity is O ^ ''("+"') •log^(l/e)^. Compared to the three recent similar bounds mentioned 
above, this guarantee avoids the extra logarithmic dependence on the dimensionality, as well as the 



assumption of independent noise, but has a slightly worse dependence on e. We emphasize that we 
do not make any assumptions about the noise, nor about incoherence properties of the underlying 
low-rank matrix X . 

We also provide a guarantee on the mean-absolute-error of the reconstruction, and discuss guar- 
antees for reconstruction using the trace-norm as a surrogate. Using the trace-norm allows us to 
provide mean-absolute-error guarantees also for matrices where the magnitudes are not uniformly 
bounded (i.e. "spiky" matrices). We further show that a spikiness assumption is necessary for 
squared-error approximate reconstruction of low-rank matrices, regardless of the estimator used. 

Instead of focusing on optimality conditions as in previous work, our guarantees follow from 
generic generalization guarantees based on the Rademacher complexity, and an analysis of the 
Radem acher complexity of the max- norm and trace- norm balls conducted bv lSrebro and ShraibmanI 
()2005D . To obtain the desired low rank reconstruction guarantees, we combine these with bounds 
on the max-norm and trace-norm in terms of t he rank. The point we make h ere is that these fairly 
simple arguments, mostly based on the work of lSrebro and ShraibmanI ()2005[ ). are enough to obtain 
guarantees that are in many ways better and more general than those presented in recent years. 

Notation. We use \M\ to denote the elementwise norms of a matrix M: |Af|-^ = \M\2 
is the Frobenius norm, and |il/|oo = maxy |Afy|. We discuss n x m matrices, and without loss of 
generality always assume n > m. 

2 The McLx-Norm and Trace-Norm 

We will consider the following two matrix norms, which are both surrogates for the rank: 
Definition 1. The trace- norm of a matrix X G ^nxm giyen hy: 

\\X\\^ ~ {singular values of X) = min |?7|2|^|2 • 

Definition 2. The m£Lx-norm of a matrix X € ig giyen by: 

II^ILa. = ^^^min^^^ (max|C/(,)|2) (max\V(^j-,\2^ , 

where C/(i) and V(^j-j denote the i"^ row of U and the j*"^ row of V , respectively. 

Both the trace-norm and the max-norm are semi-definite representable (|Fazel et al.l . l2002llSrebro et al.l . 

|2005[ ). Consequently, optimization problems involving a constraint on the trace- norm or max-norm, 
a linear or quadratic objective, and possibly additional linear constraints, are solvable using semi- 
de finite programming. We will c onsider est imators which a re solutions to such problems. 

ISrebro and ShraibmanI (j2005D and later ISherstovl (j2007t ) studied the max-norm and trace-norm 
as surrogates for the rank in a classification setting, where one is only concerned with the signs of the 
underlying matrix. They showed that a sign matrix might be realizable with low rank, but realizing it 
with unit margin might require exponentially high max-norm or trace-norm. Based on this analysis, 
they argued that the max-norm and trace-norm cannot be used to obtain reconstruction guarantees 
for sign matrices of low rank matrices. 

Here, we show that in a regression setting, the situation is quite different, and the max-norm 
and trace-norm are good convex surrogates for the rank. The specific relationship between these 
surrogates and the rank is determined by how we control the scale of the matrix X (i.e. the magnitude 
of its entries). This will be made explicit in the next section, but for now we state the bounds on 
the trace-norm and max-norm in terms of the rank which we will leverage in Section [3l 

By bounding the £i norm of the singular values (i.e. the trace-norm) by their £2 norm (i.e. the 
Frobenius norm) and the number of non-zero values (the rank) , we obtain the following relationship 
between the trace-norm and Frobenius norm: 



\X\2 < < v/rank(X) • |X|2 . (2) 

Interpreting the Frobenius norm as specifying the average entry magnitude, I^l2: can view 
the above as upper bounding the trace-norm with the square root of the rank, when the average 
entry magnitude is fixed. 

An analagous bound for the max norm, substituting £00 norm (maximal entry magnitude) for 
Frobenius norm (average entry magnitude), can be obtained as follows: 



Lemma 1. For any X e R"^'", \X\^ < ||X||max < y^ran^X) ■ \X 
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Proof. Consider the minimizing factorization X ~ UV^ and let Xij be the largest magnitude entry 

in X, then: > |[/(,)| • \V^^)\ > \X,,\ = \X\^. ^ 

To obtain the upper bound we first write the max-norm as (|Lee et al.l . |2008| ): 

||X||,nax = sup ||diag(p)Xdiag(g)2||£ , (3) 
where the supremum is over nonnegative unit vectors p, q. We can now continue using 



< sup ^ rank(diag(p)Xdiag((j)) • |diag(p)Xdiag((3')|2 



< sup VrankX • J^Ph^^^ = VrankX \X\^ . □ 



3 Reconstruction Guarantees 

The theorems below provide reconstructions guarantees, first under the a mean-absolute-error re- 
construction measure (Theorem [T]) and then under a mean-squared-error reconstruction measure 
(Theorem [2]). Since the guarantees are for approximate reconstruction, we must impose some notion 
of scale. In other words, we can think of measuring the error relative to the scale of the data — if Y is 
multiplied by some constant, then obviously the reconstruction error would also be multiplied by this 
constant. In the theorems below we refer to two notions of scale: the average squared magnitude of 
matrix entries, i.e. I^l2' ^^-^ the maximal magnitude of matrix entries, i.e. I^Iq^- For simplicity 
and without loss of generality, the results are stated for unit scale. 

An issue to take note of is whether the s observed entries of Y are chosen with or without 
replacement, i.e. whether we choose a set S of entries uniformly at random over all sets of exactly 
s entries (no replacements), or whether we make s independent uniform choices of entries, possibly 
observing the same entry twice. Our results apply in both cases. 

Theorem 1. For any M, Y £ ^nxm -^y/jg^g qJ rank at most r: 

a. Entry magnitudes bounded on-average. Consider the estimato^ 

X{S) = &ig min V %j-X,,\ . 

// ;;^|-^^|2 1 '"^'^ s > O ^ ''("+"^) '°g(") ^ ^ then in expectation over a sample S chosen either 

uniformly over sets of size s (without replacements) or by choosing s entries uniformly and 
independently (with replacements): 

— \Y-X{S)\^ < ^|r-M|i + e. 
nm nm 

b. Entry magnitudes bounded uniformly. Consider the estimator 

1(5) = arg min ^ lY^j - X,j\ . 

If \M\^ < 1 and s > O ^ ''("+'") ^ ^ then in expectation over a sample S of size s chosen either 
with or without replacements as above: 

— \Y-X(S)\i < — |y-A/|i + e. 
nm nm 

Remark 1. The above results can also be shown to hold in high probability over the sample S , rather 
than in expectation. Specifically, to ensure that the results of Theorem{l\hold with probability at least 
1 — (for sampling with replacement) or 1 — n~^^~'^^ (for sampling without replacement), it is 
sufficient to change the sample size requirement to s > O ^ ^("+™) i°g^(^)+/3i°g(") ^ ^j^^ ^^g trace-norm 
case) or s > O ^ r{n+m)+p log(n) ^ max-norm case). 

^If S is chosen with replacements, it is a multiset, and the summation j)gs should be interpreted as 
summation with repetitions. 
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Theorem 2. For any Y = M + Z e W'-'^™' where \Z\oo < y a'^'^ M is of rank at most r with 
lAfl < 1, denote = \Z\i. Consider the estimator 

I loo — ' nm I '2 

X(5) = arg mill ^ (Y,, - X,jf . (4) 

s > O ^ ''("+"') . 2__tl . (log'^(r/e) + /3)j , t/ien, with probability at least 1 — over a sample S 

of size s chosen with replacement, or with probability at least 1 — n^^^^^^ over a sample S of size s 
chosen without replacement, 

— \Y-X{S)\l<a^ + e. (5) 

If we instead use the estimator: 

XiS) = arg ^ min^ ^ ^ (Fy - X,j)^ , (6) 

then we obtain ^ when s>0 (^llZiiUSi . £f±£ . (log^(l/e) + . 

The estimator ([5]) is SDP-rcprcsentable, though potentially more cmnbersome. 

Remark 2. The requirement on the maximal magnitude of the error in Theorem\^ \Z\oa < ^ \o'^n ' 

is very generous, and easily holds with high probability for sub- exponential noise. A stricter re- 
quirement, e.g. 0{y/rlogn), which still holds with high probability for subgaussian noise, yields a 
guarantee with exponentially high probability 1 — e~"/iog"^ without a sample- complexity dependence 
on (3. 

Remark 3. A guarantee similar to Theorem\^ can be obtained if we can ensure \\M\\^^^ < A, for 



I max 

some A, without requiring \M\^ < 1. For X{S) ~ argmiii||x|| <AX]i7es(^y ~ ^ij)^ i '^^ have ^ 



with a sample of size 



>0[^^^.^.ilog^Aye)+P) 



In Section \4.S.3\ we will see how certain incoherence assumptions used in previous bounds yield 
a bound on \\M\\^^^, and compare the max-norm based reconstruction guarantee to the previously 
published results. 

In Theorems [1] and [2] wc do not assmne the noise, i.e. the entries oi Z = Y — M, are independent 
or zero-mean — in fact, we make no assumption on Z, other than the very generous upper bound 



\Z\oo < discussed above. When entries of Z can be arbitrary, it is not possible to ensure 

reconstruction of M (e.g. we can set things up so Y actually has lower rank then M, and so it is 
impossible to identify M). Consequently, in Theorems [T] and [2] we instead bound the excess error in 
predicting Y itself. If entries of Z are independent and zero-mean, then we may give the following 
guarantee about reconstructing the underlying matrix AI: 

Theorem 3. For {i,j) G [n] x [to], let J^(ij) be any mean-zero distribution. Suppose that the observed 

entries of Y are given by Y^j^ j^-) = A/^j^ + Zt for t = 1, 2, . . . , s, where {it,jt) Unif{[n] x [to]) 
and Zt\(it,jt) ^ -^{itjt) independently for each t. That is, the noise is independent and zero-mean 
(though its distribution is allowed to depend on the location of the observation), the sample is drawn 
with replacement, and if an entry of the matrix is observed more than once, then the noise on the 
entry is drawn independently each time. 



Assume |M|oo < 1, rank(M) < r, and sup^gj^j \Zt\ < o ( j with high probability. Denote 



<j' = —J2Ez,,^^,,{Z, 

71 Tn ^ — ^ ^ ^ 



nvn 



For the estimator given in Equation with high probability over the sample S of size s > 

o(^^-^-log3(r/6)), 

— |M-X(5)|^<e. (7) 
nm 
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Alternatively, is S is sampled uniformly without replacements, with the same assumptions and sample 
size, and as long as s < ^^-^{nm) , we have ■:^\M ~ X[Sy\i^ < AKe. 

Remark 4. When sampling without replacement, we imposed both a lower bound and an upper 
bound on the sample size. For these two hounds to be compatible (in an asymptotic sense) for a fixed 
K, we need m = fl (n°) for some positive power a, and make e arbitrarily small. Alternately, we can 
set K = O(logn), ensuring the upper hound on s always holds (since s < nm necessarily), yielding 

^\M-XiS)\l < e whenever s> Q '"sW . £!±£ . iog3(r/e)) . 

The remainder of this Section is organized as follows: In Section [3.1[ we prove Theorems [T] and 
[3] in the case where the sample is drawn without replacement. In Section 13.21 we discuss possible 
bounds of the mean-squared-error, as in Theorem [2l but using the trace-norm. In Section [3731 we 
compare sampling with and without replacement, establishing Theorems [T] and [5] also for sampling 
with replacement. In Section [3. 4[ we turn to the setting of independent mean-zero noise, and prove 
Theorem [3] in both the sampling- with-rcplacemcnt and sampling-without- replacement settings. 

3.1 Proof of Theorems [l] and [2] v^rhen S is drawn with replacement 

We first establish t he Theorems for a sample chosen i.i.d. with replacements. In this case, follow- 
ing [Srebro^^nT^Shraibman (20051), we may view matrix reconstruction as a prediction problem, by 
regarding a matrix X e M"^™ as a function [n] x [m] — > M. Each observation in the training set 
consists of a covariate {i,j) G [n] x [m] and an observed noisy response Yij € M. Here, we as- 
sume that the distribution over [n] x [m] is uniform, and the joint distribution over {i,j) and its 
response is determined by the unknown Y. The hypothesis class is then a set of matrices bounded 
in either trace- norm or max- norm, and for a particular hypothesis X <£ K"^™, the averaged error 
— X\i or — X\2 is equal to the expected loss L{X) = [loss(Xy , Ky)] under either 

the absolute-error or squared-err or loss, respectively. 

ISrebro and ShraibmanI (|2005( ) established bounds on the Rademacher complexity of the trace- 
norm and max-norm balls. For any sample of size s, the empirical Rademacher complexity of the 
max-norm ball is bounded by 



{{X e R-x™ I < A}) < i2^^!l!^ . (g) 

Although the empirical Rademacher complexity of the trace-norm ball might be fairly high, the 
expected Rademacher complexity, for a random sample of s independent uniformly chosen index 
pairs (with replacements) can be bounded as 



E 




Us {{X G M"'^™ I \\X\\^ < A}) < K\l »r»v" + ^")log(") 



for som e numeric constant K (this is a slightly better bound then the one given bv lSrebro and ShraibmanI 
(|2005l ) , and is proved in Appendix |B]) . 

Since the absolute error loss, loss(a;, ?/) = |a: — is 1- Lipschitz, these Rademacher complexity 
bounds immediately imply (iBartlett and Mendelsonl . [2001[ ): 



1 

nm 



Y-X{S) <„ inf^ (— |y-X|,)+24J ^'(" + "^) (10) 
1 X \._<A \ nm J V s 



for X{S) = argminjix||,^^_<AE(j,i)6S " ^ui and: 



1 

nm 




Y-X{S) < inf [ —\Y - X\j^\ +2K\I ^^^^ 



for X{S) = argmin||x||j,<A j)es ~ (For details, sec Lemma[7]in Appendix [Cl ) These 

provide guarantees on reconstructing matrices with bounded max-norm or trace-norm. Choos- 
ing A = ^/r for the max-norm and A = y/rnm for the trace-norm. Theorem [1] (for sampling 
w ith replacement) follows from Equation ^ and Lemma [TJ (Remark [T] follows from the results 
of lBartlett and Mendelsonl (|200lh with identical arguments for the sampling- with-replacement case.) 

In order to obtain Theorem |2l we use a recent bound on the excess error with respect to a 
smooth (rather then Lipschitz) loss function, such as the squared loss. Specifically, Theorem 1 of 
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ISrebro et all (|2010t ) states that, for a class of predictors X :! ^ [~B, B] and a loss function bonded 
by h with second derivative bounded by H, with probability at least 1 — 5 over a random sample of 
size s, 



L{X)<L* +Oy^jL*Tls+Tlsj , (12) 
L* = vaiL(X) , 

X 

■75 u'v^^ sf B \ , b\og{\og{s)/6) 

7^s = i^7^, log 1^1+ > (i3) 

where the infimum is over predictors in the class, X is the empirical error minimizer in the class, 
and TZs is an upper bound on the Rademacher complexity for all samples of size s. 

In our case, for the class {X\ \\X\\^_^_^^^ < A} and the squared loss, we have B = s\rpx sup^ \Xij \ = 

supx l^loo ^ supjf ll^llmax ^ ^ and b = sup^ \X - < ^J^^^-, when we assume \Z\^ < 

^\/ io<y'(n Tm) • ^PP'^y^'^S bound ([5]) on the Rademacher complexity yields: 

jl^^Qf A'^jn + m) 3 / s \ ^ yl^(n + m) log logs ^ A^(n + m) log(l/.5) \ 
\ s \n) s\og{n + ra) slogn / 

Here the last inequality uses the fact that s < n^, while the next-to-last inequality assumes s > 
6*^(71 -t- m), and applies the fact that a;^ log'^(l/x) is an increasing function for x < e^^'^, where in 

this case x ~ \ . 



Remark [3] follows immediately. The first claim in Theorem [2] follows when we assume |M|oo < 1 
and rank(M) < r and set A = ^/r (since, by Lemma [U ||A/||niax < ^)- If we instead consider the 
class {X : ||X||niax < %A', \X\oo < 1}, then in the notation of (fT2|). we may define B = 1 instead of 
B = A — ^/r, and thus obtain 

. o (li!i±I=l (i„g= (^J^) + !=SM1) ) , (16) 



r(n + m) J log n 

which yields the second claim of Theorem O 

Finally, we prove the claim Remark [21 If instead we assume \Z\oo < v^rlogn, then in the the 
notation of (fT2)) . we may define h = r logn instead of 6 = " = and thus obtain 

Hs^oi '-^^^ (loi ( + ^"g" • l°g(^/^) )) . (17) 
\ s \ \{n + m) J n + m J J 

For 6 < e^"/'°^'", the second term is dominated by the first; therefore the sample complexity no 
longer depends on /?. 

3.2 Bounds on £2 error using the trace norm 

In Theorem [1] wc saw that for mean-absolute-error matrix reconstruction, using the trace-norm 
instead of the max-norm allows us to forgo a bound on the spikiness, and rely only on the average 
squared magnitude \ Y^. One might hope that we can similarly get a squared-error reconstruction 
guarantee using the trace-norm and without a spikiness bound that was required in Theorem [21 
Unfortunately, this is not possible. 

In fact, as the following example demonstrates, it is not possible to reconstruct a low-rank matrix 
to within much-better-then-trivial squared-error without a spikiness assumption, and relying only 
on \ Y\2 < 1. Specifically, consider an n x m matrix 



Y = VW^(-4| 0„x(m-r)) 

where A e {±1}"^'' is an arbitrary sign matrix. The matrix Y has rank at most r and average 
squared magnitude \Y\2 ~ 1 (but maximal squared magnitude \Y\^ ~ m/r). Now, with even 
half the entries observed (i.e. s = nm/2), we have no way of reconstructing the unobserved entries 
of A, as any values we choose for these entries would be consistent with the rank-r assumption, 
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yielding an expected average squared error of at least 1/2. We can conclude that regardless of the 
estimator, controlling the average squared magnitude is not enough here, and we cannot expect to 
obtain a squared-error reconstruction guarantee based on ^ \y\21 even if we use the trace- norm. 

We note that if , \ Y\^ = 0(1), then the squared-loss in the relevant regime has a bounded 

Lipschitz constants, and Thcorcm[T^ applies. In particular, if \M\^ , 1^ < 1, then we can consider 
the estimator 

XiS) = arg min^ ^ {Y,, - X,,f . (18) 

\ X\\ <^rnm 

Since we now only need to consider X where \Xij —Yij\ < 2, the squared-loss in the relevant 
domain is 4-Lipschitz. We can therefore use the standard generalization results for Lipschitz loss as 
in Theorem [TJ and obtain that with high probability over a sample of size 

„ / r(n + m) log??A 



we have — X{S)\2 < + e. However, this result gives a dependence on e that is quadratic, 

as opposed to the more favorable dependence (at least when e = 17 (cr^)) of Theorem [5J 

We believe that, when |M|oo,|i^|oo < 0(1): it is possible to improve the dependence on e to a 
dependence simila r to that of Theorem [2] (this would require a more delicate analysis then that of 
ISrebro et al.l (|201Q) . as their techniques rely on bounding the worst-case Rademacher complexity). 
But even this would not give any advantage over the max- norm, since the bound on |A/|oo could 
not be relaxed, while an additional factor of logn would be introduced into the sample complexity 
(coming from the Rademacher complexity calculation for the trace- norm). It seems then, that at 
least in terms of the quantities and conditions considered in this paper, as well as elsewhere in the 
low-rank reconstruction literature we are familiar with, there is no theoretical advantage for the 
trace-norm over the max-norm in terms of squared-error approximate reconstruction, though there 
could be an advantage for the max-norm in avoiding a logarithmic factor. 

3.3 Sampling with or v^rithout replacement in Theorems [1] and [2] 

Theorems [1] and [2] give results that hold for either sampling with replacement or sampling without 
replacement. When an entry of the matrix Y is sampled twice, the same value is observed each time — 
no new information about the matrix is observed, and so intuitively, sampling without replacement 
should yield strictly better results than sampling with replacement. The two lemmas below, proved 
in the Appendix, establish that sampling without replacement is indeed as at least as good as 
sampling with replacement (up to a constant). 

Before stating the lemmas, we briefly introduce some notation. Let L{X) denote the loss for an 
estimated matrix X; that is, L{X) = ^ \Y — X\j^ or L{X) = \Y — X\2, as appropriate. Let 
Ls{X) denote the empirical loss, Ls{X) = j)<^s\^ii ~ ^ijY' (where p € {1,2} and the sum 
includes repeated elements in 5"). Let and 2?^/^ denote the distributions of a sample of size s 
drawn uniformly at random from the matrix, either with or without replacement, respectively. 

Lemma 2. Let X denote any class of matrices, with V and 1^^, defined as above. Then 



sup L{X)~Ls{X) 



< Es. 



sup L{X) - Ls{X) 



Lemma 3. Let X denote any class of matrices, with and 2?^^^ defined as above. Then for any 
c S M, and for any function g, 

Psr^vi,^ I Qup .9(i(X)) - Ls{X)^ > c| < 4s • Psr^v^ | Qup .9(L(X)) - Ls{X)^ > c 

For the ^i-loss case, the Rademacher bounds ([TU| and ([TT|) are derived from lBartlett and MendelsonI 
ilOOl by bounding ^s-d- [siip^ex H^) - Ls(X)) (or Ps^jy^ (supxe;^ H^) " Ls{X) > c) , for 
the proof of Remark [T]). By Lemma [51 the same bound then holds for the same expectation taken 
over S ~ 2?^/oj and therefore ^TU\\ and ((TT|) must hold for this case as well. This implies that the 
results of Theorem [1] (and Remark [1]) hold for sampling without replacement as well as sampling 
with replacement. 
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Similarly, for the ^2-loss case, the Rademacher bound (fT^ is derived in lSrebro et alj (j2010l ) by 
bounding swpxex 

L{X) - y^a-L{X) - Ls{X) for some constant a, with probability at least 1 — 5 
over S r^V-\ Defining g{L) ^ L - Va ■ L, the same bound must therefore hold with probability at 
least 1 — 4s(5 > 1 — An^S over S ~ 2?^„/o: and therefore ([12]) holds for this case also. This implies 
that the results of Theorem [2] (and the subsequent remarks) hold for sampling without replacement 
as well as sampling with replacement. 

3.4 Proof of Theorem [3} independent errors in the ^2-loss setting. 

First, we prove the theorem when sampling with replacement. For a matrix X, let L{X) denote the 
expected squared error for a randomly sampled entry, that is, 



niTi ^ — ^ rim ^ — ^ ^ 



nm 

{i-j) (iJ) 



Now write = ^ Ez^^^,.,, [Z^)- Then L(M) = a\ 

Then, for any sample 5, given X{S) which is a random matrix depending on some observed 
sample, the expected loss (over a future observation of an entry in the matrix) of X{S) satisfies 
the following (due to the fact that noise in a future observation of the matrix has zero mean and is 
independent from X{S)): 

L{X{S)) = ((r,, - X{S\^f\x{S)) = {{Z + Ah, - X(S),,)'\X{S)) 

^i^^,)^z^T., (Z^ + (M», - X{S\, f\x{S)) - {Z^) + X{S)\l 



1 



= a^ + \M~X{S)\l . 

nm' ^ "'^ 

Therefore, following the same reasoning as the proof of Theorem [5] (and Remark [21 we have that 
if s > O (^!l2±!!i) . 2l±l . log3(r/e)) , then with high probability, 

L{X) <a^ + e. 

Applying the work above, we obtain 

— |M-X(5)|2<e. (20) 
nm 

Now we turn to sampling without replacement. We first state a lemma which is proved in the 
appendix. (Notation: here V and 2?^^^ again denote sampling with or without replacement, but in 
this context V represents sampling with replacement when the noise is added independently each 
time an entry is sampled, as in the statement of Theorem [31) 

Lemma 4. Let X denote any class of matrices, with V and 2?^^^ defined as above. For any c, if 
s satisfies s < ^'^^ (nm) , then 

Ps^v^ , (sup g{L{X)) - Ls{X) > < 4if • Pg^v^ fsup 9{L{X)) - Ls{X) > {2K)- 



As in the proof of the sampling- without-replacement case of Theorem [21 this is sufficient to show 
that —\M — X(5)|2 < • e with high probability for the stated sample complexity, as long as we 

nm ^ ^ 

also have that s < ^^-^{nm) . 



4 Comparison to prior work 

Suppose Y = Al + Z where rank(A/) < r and Z is a "noise" matrix of average squared magnitude 

cr^ = \Z\•2^ and we observe random entries of Y . One might then consider different types of 
reconstruction guarantees, requiring different assumptions on A/, Z and the sampling distribution: 

Exact recovery of M : ^(S) = M . 

Near-exact recovery of M : :^\X{S) - M\l < e • cr^ . 

Approximate recovery of M : :^\X{S) - M\l < e ■ scalc(A'f) . 

Approximate recovery of Y : —\X{S) — Y\2 < + e ■ scale(M) . 
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Exact or near-exact recovery require strong incoherence- type assumptions on the matrix M, 
and is not possible for arbitrary low-rank matrices (see, e.g. ICandes and Rech-3 (|2009[ )). Here we 
do not make any such assumptions, and show that approximate recovery is still possible. Such 
approximate recovery must be relative to some measure of the scale of M, and we discuss results 
relative to both the maximal magnitude, scale(M) = |M|^, and the average squared magnitude 
scale(M) = ^ 1-^^12- Although not actually guaranteeing the same type of "recovery", in Section 
I4.2l we nevertheless compare the sample complexity required for our approximate reco very results to 
the best sample coraplexitv guarantee for exact and near-exact recovery (obtained by iRechtl (|2009l ) 
and iKeshavan et al.l (|201Clt) , respectively) , and comment on the differences between the required 
assumptions on M. 

M ore di rectly comparable to our results arc recent results bv lKeshavan et all (|201Clf ). lNegahban and Wainwrigh^ 
(|2010[ ) and lKoltchinskii et al.l (|2010[) on approximate recovery of M. These give essentially the same 
type of guarantee as in Theorem[31 and also rely on |M|^ as a measure of scale. In Section Wl] we 
compare our guarantee to these results, discussing the different dependence on the va rious parameters 
and d ifferent assumptions on the noise. (Note that both types of results appear in IKeshavan et all 
(I2OIOI ): in Section l4.ll we refer to the approximate recovery result stated in Theorem 1.1 of their 
paper, while in Section [4. 2[ we refer to the near-exact recovery result stated in Theorem 1.2 of their 
paper.) 

Recovery of M , whether exact, near-exact, or approximate, also requires the noise to be inde- 
pendent and zero-mean, otherwise M might not be identifiable. All prior matrix reconstruction 
results we arc aware of work in this setting. Approximate recovery of A/ also immediately implies an 
excess error bound on approximate recovery of Y. However, we also provide excess error bounds for 
approximate recovery of Y, that do not assume independent nor zero-mean noise (Theorems |T] and 
12). That is, we provide reconstruction guarantees in a significantly less restrictive setting compared 
to other matrix reconstruction guarantees. 

Another difference between different results is whether entries are sampled with or without 
replacement, and if replacement is allowed, whether the error is per-entry (i.e. repeat observations 
of the same entry are identical) or pcr-observation (i.e. repeat observations of the same entry are 
each corrupted independe ntly). Howeve r, as we show in Sections 13.31 and 13. 4[ and as has also been 
shown for exact recovery (jRechtl . l2009t ). these differences do not significantly alter the quality of 
reconstruction or the required sample size. 

The most common algorithm for low-rank matrix recovery in the literature is squared-error 
minimization subject to a penalty on trace norm. All the methods cited here prove result s about 
some variation of this approach, with the exception of a recent result by IKeshavan et al.l (|2010[ ) , 
which applies to the output of the local search procedure OptSpace. In contrast, our results are 
mostly for error minimization subject to a max-norm constraint. 

4.1 Comparison With Recent Approximate Recovery Guarantees 

iNegahban and Wainwrightl (j201Clf ) and iKoltchinskii et all (|2010[ ) recently presented guarantees on 
approximate recovery using trace- norm regularization, in a setting very similar to our Theorem [3] 
Earlier work bv lKeshavan et all (j2010[) uses a low-rank SVD approximation to Y5 in the same setting 
to also obtain an approximate recovery guarantee. (Here I5 is the matrix consisting of all observed 
entries of Y , with zeros elsewhere, and Yg is the same matrix with overrepresented rows and columns 
removed.) In particular, each of the three guarantees provide an e-approximate reconstruction of M 

relative to |Af |^. That is, when |M|^ < 1 as in Theorem|3l they provide the exact same guarantee 
^ X{S) ~ M < e. ([Negahban and Wainwrightl state the result relative to ^ \M\2j but have a 

linear dependence on the "spikiness" jj^j^j^=j effectively giving a guarantee relative to |Af|^). 

Specifically, as suming |Af|oo = 1 without loss of generality, iNegahban and Wainwrightl and 
IKoltchinskii et aTl assume the noise is independent and subgaussian (or subexponential) with vari- 
ance O(cr^), and require a sample size of: 

„ / rnlog(r7,) , oA , , 

s > O ^ • (1 + cr2) . (21) 



where the sample is drawn with replacement — in particular, an entry of the matrix which is 
sa mpled multiple ti mes gives multiple independent estimates of My . 

IKeshavan et al.l give a result on approximate recovery which holds with no assumption on the 
noise, but requires additional assumptions such as i.i.d. noise to be a meaningful bound. The esti- 
mator used is the rank-r SVD approximation to Ys, defined above. Specifically, they show that, for 
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sufficiently large sample size, with high probability, -^\X{S) - Af I2 < O (^ "''^ " + ^ll^slll j > 

where Zs is defined in the same way as Ys- For this bound to be meaningful, there must be 
some distributional assumption on Z — otherwise, we could have ||^s||2 ~ \Zs\2 = 0(-\/s), and the 
bound on mean error would actually increase wi th and is t hus not a meaningful bound. In 
the presence of i.i.d. subgaussian noise, however, iKeshavan et al.l show that with high probability, 

IIZslll < _ Using this, approximate recovery of M is obtained for sample complexity 

5>o(^-(V^)-(l + log(n)a2)) , (22) 

where the sample is drawn without replacement. Therefore we may regard iKeshavan et aLf s result as 
bounding error under the assumption of i.i.d. subgaussian noise (or perhaps some weaker assumption 
that gives the same result, such as independent subgaussian noise that might not be i.i.d., or similar). 
The guarantees (|22p and (|2ip are therefore quite similar, even though they are for fairly different 
methods, with (|22p being better when cr^ — o(l) but worse for highly rectangular matrices. 
Comparing our Theorems [2] and [3] to the above, the advantages of our results are: 

• We avoid the extra logarithmic dependence on n. 

• Even in order to guarantee recovery of Af , we assume only a much milder condition on the 
noise: that noise is mean-zero, and that with high probability, |Zs|oo < yi^^- '^^ ^^t 

assume the noise is identically distributed, nor subgaussian or subexponential. 

• We provide a guarantee on the excess error of recovering Y , even when the noise is not zero- mean 
nor independent. 



The deficiency of our result is a possible slower rate of error decrease: when cr > and e = 
o(cr^) (i.e. to get "estimation error" significantly lower then the "approximation error"), our sample 
complexity scales as 0(l/e^) compared to just 0(l/e) in the other results. We do not know if 
this difference represents a real consequence of not assuming zero-mean independent noise in our 
analysis, or just looseness in the proof. Our results also include an additional log"^(l/e) factor, which 
we believe is purely an artifact of the proof technique. 

A strength of our analysis, as compared to that of lNegahban and Wainwrightl and lKoltchinskii et al.l . 
is that the cases of sampling with and without replacement are both co vered, including the case of 
per- entry noise when s ampling with replacement, while the results of iNegahban and Wainwrightj 
and iKoltchinskii et all are for sampling with replacement with per-observation noise. This is an 
important improvement because in many applications, the observed entries are drawn from a fixed 
matrix which was randomly generated, meaning that it is not possible to obtain multiple independent 
observations of any A/^ . 



4.2 Comparison of results on exact and near-exact recovery 

The results of lRechtl and of IKeshavan et all show that exact or near-exact recovery of the underlying 
low-rank matrix M can be obtained with high proba bility, w hen strong conditions on M are assumed, 
and when the observations are either noiseless (for iRechtl 's exact recovery result) or are corrupted 
by i.i.d. subgaussian noise (for IKeshavan et "all 's near-exact recovery result). 

These results cannot be directly compared to the results we obtain in this paper, because the 
guarantees on recovery given by this work and by our work are fundamentally different — for instance, 
the error bound e has completely different meanings in our definitions of near-exact recovery and 
approximate recovery above. These two incomparable types of guarantees are linked to very different 
conditions on the data — exact and near-exact recovery cannot be obtained without strict assumptions 
about how the observations are generated. 

Nonetheless, one comparison between these methods which can be made, is in the magnitude of 
th e requir ed sample complexities to obta in some meaning ful bound via each result — exact recovery 
for lRechtl 's result, near-exact recovery for IKeshavan et aLl 's result, and approximate recovery relative 
to |Af|^ for our result. The rest of this section is organized as follows: we summarize the results 
in the literature in Section 14.2.11 compare sample complexities in Section 14.2.21 and describe how 
incoherence is sufficient but not necessary for approximate recovery relative to |Af|2 (instead of 
|A/|^) in Sections 11231 and 11231 
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4.2.1 Details on exact and near-exact results in the literature 

Let M = UYy^ be a reduced SVD o f M . Let k be the cond ition number of E. Define also the 
incoherence parameters for matrix M (|Candes and Rechtl . |2009[ ) : 

fn 2 i2 

/xo =max< - - max C/(j) 2,— • max 2 

where [/(j) denotes the ith row of U and V(^j) denotes the jth row of V . 

Suppose that M has low incoherence parameters and Z = Q. Improving on the earlier results of 
ICandes and Rechtl (|2009t ) and lCandes and Tad (|2010t ). iRech^ proves that X{S) = M (that is, exact 
recovery is obtained) with high probability if 

s > O (rnmax{/io,^i}log^n) . (23) 

In the case of noisy observations, iKeshavan et al.l give conditions on low ^2 error in recovery 
(with high probability) in the setting of i.i.d. subgaussian noise with incoherent M, im proving on 
ICandes and PlanI (|201Clt) earlier work on the noisy case. (More preciselv. IKeshavan et al.l give a result 
which holds with no assumption on the noise, but requires additional assumptions such as i.i.d. noise 
to be a meaningful bound. We therefore regard their result as assuming i.i.d. subgaussian noise — see 
the discussion of their approximate reconstruction result above in Section HTTP Their OptSpace 
algorithm is a method for finding the rank-r matrix X minimizing squared error on the observed 
entries. Let X(S) denote the matrix recovered by this algorithm. When the entries of Z are i.i.d. 
subgaussian. IKeshavan et al.l show that, with high probability, if s satisfies 

s > O ^rTiK** • max |i log ^^^^^^^ , rK^/Xg,rK^/ij|^ , (24) 

then 1^(5) — M|| < • e. (For simplicity of the comparison, we use a slightly relaxed form of 
their required sample complexity, and ignore ^Jn/m in their error and sample bounds.) 

4.2.2 Comparing sample complexities 

Ignoring the dependence on e, which as we discussed earlier is in any case incomparable between 
approximate and exact and near-exact recovery, our sample complexity for approximate recovery 
using the max- norm is 0{rn). Even with "perf ect" in coherence parameters, this a factor of log^(rt) 



less then the sample complexity establishe d by Recht for ex act recovery and a factor of r less 



then the sample complexity established bv IKeshavan et all for near-exact recovery Of course, 

"bad" incoherence parameters may sharply increase the sample complexity for exact or near-exact 
recovery, but do not affect our sample complexity for approximate recovery. 

4.2.3 Approximate recovery relative to average signal magnitude, in the presence of 
incoherence conditions 

It is interesting to note that the incoherence assumptions, used by iRechtl and by IKeshavan et al.l . 
enable approximate recovery with the max- norm relative to the average magnitude |-^l2i ^-'^'i 
not only the maximal magnitude, as in Theorem |2| This is based on the following observation: 

Lemma 5. Let M G M"^™ and let k and /io be defined as before. Then 

p/||raax < min{K, \/r}^ioVr ■ '^^'^ 



/Tim 

In particular, by Lemma{^ the above expression is also an upper bound for \M\ao- 
Proof. First, observe that 

IJ-or 



A/||niax < max|(C/E)(i)|2 • |V(j)|2 < cTi • max|t/(i)|2 ■ |V(j)|2 < cri 



Also, 



K /—, kM 



(71 < K^/a^ < al + • • • + 0-2 ^ -L-^ and cri < \ al + ■ ■ ■ + = \M\2 . □ 

\/r V Jr V 
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Now, based on Remark |31 if 
over a sample of size 



1 

nrn 



< 1 (and with a mild bound on I^Iq^), with high probability 



s > O ( — 



^.min{«^r}^^log3(^ 



(25) 



we have I Y 



X{S)\2 < (T^ + e. Up to log factors and th e dependence on e, this sample complexity 



is at most as much as the sample complexity required by iKeshavan et al.l . given in ([24 



4.2.4 Approximate recovery relative to average signal magnitude, in the absence of 
incoherence conditions 

We make note of several special cases where using max- norm and the concentration result, and 
bounding excess error relative to — |M|^, may compare more favorably to other methods than the 
results above would indicate. 

• If U — V (that is, M is symmetric), then ni = iiQy/r and so our sample complex ity compares 
more favorably to the sample complexities obtained by iRechtl and IKeshavan et al.l (which both 
involve 



\M\2- An example 



Our sample complexity uses Lemma |S] to bound 

where k = 1 and ||M||niax ^ ^°^mn^^ ^^'^ bound in LemmajSjis extremely loose) is the case 
where the spiky columns of U do not align with the spiky columns of V , for example writing 



n = m 


= N - 


- 1 we have: 














( ^ 






f 


iV-l/2 1 




Ar-1/2 




M = 





iV-1/2 




7V-1/2 







I 'o' 


iV-1/2 ) 




I Ar-i/2 





/ iV-1/4 \ / iV-1/4 \ 



V 

Since the left-hand factorization is an SVD of M (omitting S = ^2), we therefore have ^o^/r ■ 



N- 


1/4 






1/4 





N" 


1/4 






1/4 





N- 


1/4 ^ 




^ N- 


1/4 


/ 



|M|2 



1 while the right-hand factorization shows that ||M| 



< 



• Large condition numbers k can often lead to the same situation, in which the max norm is far 
lower than the bound implied by Lemma [5] For example, if low-rank M is a matrix where 

, but if we perturb M slightly and add an extremely low singular value, 

then K becomes extremely high while ||A/||niax is only slightly perturbed. 

5 Summary 

We presented low rank matrix reconstruction guarantees based on an existing analysis of the 
Rademacher complexity of low trace-norm and low max-norm matrices, and carefully compared 
these to other recently presented results. We view the main contributions of this papers as: 

• Following a string of results on low-rank matrix reconstruction, showing that an existing 
Rademacher complexity analysis combined with simple arguments on the relationship between 
the rank, max-norm, and trace-norm, can yield guarantees that are in several ways better, and 
relying on weaker assumptions. 

• Pointing out that the max-norm can yield superior reconstruction guarantees over the more 
commonly used trace-norm. 

• Studying the issue of sampling with and without replacement, and establishing ri gorous generi c 
results relating the two settings. This has been done before for exact recovery (IRechtl I2OO9I) . 
but is done here for the more delicate situation of approximate recovery of either M or F. 

The main deficiency of our approach is a worse dependence on the approximation parameter e, when 
CT > (i.e. the approximately low rank case) and e = o(cr^) (i.e. estimation error less then approx- 
imation error). Although this dependence is tight for general classes with bounded Rademacher 
complexity, we do not know if it can be improved in Theorem [2] In particular, we do not know 
whether the less favorable dependence is a consequence of not relying on zero-mean i.i.d. noise, or 
not relying on M having low-rank (instead of only assuming low max-norm), or on relying only on 
the Rademacher complexity of the class of low max-norm matrices — perhaps better bounds can be 
obtained with a more careful analysis. 
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A Proof of Sampling- Without-Replacement Lemmas 

Proof. (Lemmas\^ and\3^. Let = {S" G Af" : each x £ X appears at most r times in S}. Let 
S' ~ 27* denote a sample 5* drawn uniformly from Sr. In particular, Vq = 2?,^/o and I?| = V. By 
Lemma ini (proved below), for any r, 

Es^v' , (sup L(h) - Lsih)] < Es^v^^ (sup L{h) - Lsih) 
Ps^v' , (sup g{L{h)) - Lsih) > c) < rl ■ Ps^v^^ (sup - Lsih) > c 

\heH J \heH 

Taking the first inequality with r = s, this completes the proof for Lemma [H 

Now we complete the proof of Lemma [3] Take S ^ V and write S = {ei, . . . ,es}. For any 
ii < 12 < ■ ■ ■ < iK+i, ^ 

P ( Gil = 6i2 = ■ ■ ■ = eij^_|_j j = -, r-jr , 

and so for any K with iK + 1)! > 2.s, the probability that any entry of the matrix appears at least 
iK + 1) times in S is bounded by 

Fix the smallest K such that (/i + 1)! > 2s. This implies K\ < 2s. We then have 



Ps-D', sup giLih))- Lsih) >c 
< K\ ■ Ps^vf, (sup giLih)) - Lsih) > c] 

\he-H / 



< Kl ■ (Ps^-ps ( each x £ X appears at most K times in S)) ^ • Psr^v= ( sup giLih)) — Lsih) > c 

< 2KI ■ Ps^T>s [ sup giLih)) - Lsih) > c 

\he-H 



< 4s • Ps^vs sup giLih)) - Lsih) > c 
Xhen 

This is completes the proof for Lemma |31 

□ 

Lemma 6. Using the notation of the proof above, for any r, 



Es^v^ , sup Lih) - Lsih) < Es^vs sup Lih) - Lsih) , 

\heH J \heH J 

Psr^v^,^ (sup giLih)) ~ Lsih) > c ) < r! • Pg^p. (sup giLih)) - Lsih) > c ) . 



hen J \heH 



Proof. Write D, = [n\x [m]. Let aiS) be any function of the sample S, where S may contain repeated 
entries. Assume that, for any S, Si, . . . , Sr of equal size such that r ■ S ^ Si + ■ ■ ■ + Sr, «(•) satisfies 
the following for some function a(r): 

r 

a(r)-a(5)<^a(5,). (26) 

i=l 

Consider all samples from fi, drawn with replacement. For a sample set S of size s, for i = 
1, . . . , s, let NiiS) equal the number of elements of appearing exactly i times in S, which obeys 
J2i^^iiS) = s. We call N(5) (A^i(S'), . . .,NsiS)) the multiplicity vector of 5"; note that, when 
convenient, we might write N(S') to have length greater than s (filling the last terms with zeros). 
From this point on, we will regard these samples as ordered lists, and assume that in any sample, S 
is ordered in the format 

/I 1 22 2 2 333\ 

(w^, . . . ,ujpf^(^g^,uj-^^,u;i, . . . ,uj^^f^g-^,UJpf^(^g-^,u;i,u;i,UJj^, . ..) , 
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where for any i we might permute the w* 's. 

Let N be any multiphcity vector, of the form (TVi, . . . , Nr, 0, . . . , 0) for some r < s. Let N' and 
N" be multiphcity vectors derived from N as follows: 



C Ni+rNr, i = l 
I 0, i>r 



Ni, l<i<r-l 
0, i>r 



Define s = J2^ iN^. Note that Y.^ iN[ = s and = s- rNr. 

Let S = {5 : N(S') = N}, §' = {S : N{S) = N'}, §" = {5 : N{S) = N"}. We wih first prove 
that Es'^umfiS') [o^{S')] < Es^unif{S) [(^(S)], and then induct on r. 

First consider §'. We have 

\S'\Es>^Umm) HS')] = J2 HS')] = E E ["(^" + ^1 + • • • + Ar)] ■ 

S'eS' S"es" ^i,...,Arcn\s" 

Aj^s disjoint 

The last equality arises when, starting with some S' £ S', we recall that S" is an ordered sample set 
beginning with the A^i + rNr elements which appear exactly once. Let S" be the first iVi elements 
of S' , then let Ai be the next elements of S' , let A2 be the next Nr elements of S' , etc. 
Next consider S. As before, we have 

ms^umm Hs)] ^ E ["(^)] = E E ["(^" + ■ ^)] • 

ses S"€S" Acn\s 

By counting how many times each choice of A appears in the sum below, and then rescaling accord- 
ingly, we get 

E E EK5"+r.A,)] 

S"es" Ai,...,A^cn\s" 3 



{nm-Ni Nr)! 

{nm - Ni Nr-i - rTV^)! 



Aj 's disjoint 



{nm-Ni Nr)l \ ^ a{r) 



^E E aiS" + M + ...+Ar). 



(nm-Ni Nr-i-rNr)lJ r 

^ r i ry / S"&" Ai,-,A,Cfl\S" 

\A,\ = N^ 
Aj 's disjoint 

To summarize so far, we have 

' \ [nm — I\i — ■ ■ ■ — Nr-i — rNr)\ J r ^ ' 

Next, we see that (since sample sets are treated as ordered) 



' ' {nm-Ni iV^)!' ' ' {nm-Ni Nr-i-rNr)\ 

Therefore, 

Esr^UnifiS) HS)] > ■ Es'^UnifiS') HS')] ■ 

By inducting over r, we then see that 

Es^umfis) HS)] > ^^^^p^ • ^^s~d:,,„ HS)] , 
where § = {S* : N(5) = N} for any muhiplicity vector N = (A^i, . . . , A^^^, 0, . . . , 0). Therefore, 

Es^v^^ HS)] > n^^i^ • Es^v^ , HS)] , 

Finally, we observe that if a{S) — sup^^^^ L{h) — Ls(h), then a{S) satisfies ([26| with a(r) = r, 
while if a{S) ^ I |sup,,g^ - Ls(/i) > c|, then a{S) satisfies ^ with a{r) = 1. This 

concludes the proof. 

□ 
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Proof. ( Lemma ) 

Suppose s < (nm) ^'+1 . Then, as in the proof of Lemma [3l 

fs\ 1 

P (any entry is sampled more than K times) < 



KJ {nm)^^^ 

< 



< 



{K + l)l{nm)K 
(K + l)/e)^+i(nm)^' 



< -, by Stirhng's approximation. 

We show below that, for any c, 

Ps~v^^ , (sup g{L{X)) - Ls{X) > < 2K ■ Ps^vf, f sup giL{X)) - Ls{X) > {2Ky 

where I?^/^ and are defined as in the proof of Lemmas [2] and [31 except with the independent 
noise model. As in the proof of Lemmas [5] and [31 this implies that 

Ps~p» , { sup g{L{X)) - Ls{X) >^<AK- Pg^v^ ( sup g{L{X)) - Ls{X) > {2K)-^c 
\X£X J \xex 

We now prove that, for any c, 



Ps^v^ , sup g{L{X)) - LsiX) >c]<2K- Ps^vi, sup g{LiX)) - Ls{X) > {2K) 
\X£X J \xex 

Write = [n] X [m]. Consider all samples from i7, drawn with replacement. When a particular 
is drawn multiple times, then the observed values at that entry of the matrix follow the 
independent noise model as described in the statement of Theorem [31 

For a sample set S of size s, for i = 1, . . . , s, define N(S') as in the proof of Lemma[ni Let N be 
any multiplicity vector, of the form (iVi, . . . , iV^, 0, . . . , 0) for some r < s. Let M be a multiplicity 
vector defined from N as follows: 

M = (A/,)i, where Mi = N2i-x + 2N, + A^^+i . 

Now take any Ai,A2, . . . , A2r,B2, ■ ■ ■ , C [n] x [m], all disjoint, with \Ai\ — \Bi\ — Ni for all 
i. Define Bi ^ Ai, and 

2r / i \ 2r / i 

^--E Ea^^M -^b^E E^P^ 
1=1 \j=i J 1=1 \j=i 



Note that N(S'a) = N(S'b) = N. Now define 

2r ( \h\ 



t=i \j=i 



Note that N(Ti) = N(T2) = M, and that up to reordering, Sa + Sb = Ti+ T2. We treat Ti and T2 
as functions of {Sa,Sb)- 

Write ac{S) = I |supjfg;t^ — //^(X) > c|. Then a satisfies the following whenever 

l^il = l^2|: 

I (a2c{Si) + a2c{S2)) < ac(5i + ^2) < a,{Si) + a,{S2) . 
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Therefore, 

2-Bs~;7m/(N) (o^ciS)) = {#{Sa,Sb) pairs as above)" 

> (#(S'/i,5b) pairs as above)" 
= (#(S'a,S'b) pairs as above)" 

> {^{Sa, Sb) pairs as above)" 
= {4f{SA, Sb) pairs as above)" 



^ X! ac{SA) + ac{SB) 

{Sa-,Sb) as above 
{Sa-,Sb) as above 

■1 ^ a,(Ti+r2) 

{Sa-,Sb) as above 

E ^ ("2c(Ti) + a2c(r2)) 

{Sa-,Sb) as above 

E "2c(Ti) 



(Sa,5b) as above 

= (#(S'a, 5_b) pairs as above) ""^ E a2c{T) ■ (#(5^, Sb) pairs such that T 

r:N(T)=M 

We also have the following (note that here we treat samples as unordered, unlike in the proofs 
of Lemmas [5] and [3]) : 

(#(S'a, Sb) pairs as above) 



Ni,N2,N2,N3,N3,...,N2r,N2rJ ' 



and for any T with N(T) = M, 

r 

(#(S'a, 5*3) pairs sueh that T = Ti) = J| 

Finally, 

and therefore, continuing from above, 

2i?s~c/ni/(N) (ac(5')) = (#(5'yi, 5b) pairs as above)"^ E (^2c{T) ■ (#(5^, ^s) pairs such that T 

T:N(T)=M 

= (#r : N(T) = M)-i E "2c(r) 

T:N(T)=1VI 

Inducting over r, we see that for any N = (A^i , . . . , Nr,0, . . . ,0, 

where K{r) is the number of times that the operation x i— > \x/2~\ must be applied iteratively to r 
to obtain 1; note that 2^('') < 2r. Therefore, 

Ps^v^ , f sup g{L{x)) - Ls{X) > c] < 2rP5~D^ f sup g{L{x)) - Ls{X) > (20-10) . 

□ 

B The Rademacher Complexity of the Trace-Norm Ball 

ISrebro and ShraibmanI (j2005[ ) established that for a sample S = {(ii, Ji), • • ■ , (is,. 7s)} of s index- 
pairs, the empirical Rademacher complexity of the trace-norm ball, viewed a predictor of entries, is 
given by: 



7^s ({(i,.?) ^ I X e ]R"X'", \\x\\^ < A}) = 



1 " 



(27) 
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where the expectations is over independent uniformly distributed random variables ^i, . . . , G ±1, 
IIXII2 is the spectral norm (maximal singular value) of X, and Cij = e^ej is a matrix with a single 
1 at location and zeros elsewhere. Analyzing the Rademacher complexity then amounts to 

analyzing the expected spectral norm of the random matrix Q = J2t=i ^i^it,jf 

The worst-case Rademacher complexity, i.e. the supermum of (P7)) over all samples S*, is and 

does not lead to meaningful generalization results. Indeed, if we could meaningfully bound the worst- 
case Rademacher complexity, we could guarantee learning under arbitrary sampling distributions 
over index-pairs, but this is not t he case — we know that trace-nor m regularization can fail when 
entries are not sampled uniformly (jSalakhutdinov and Srebrol l2010f ) . 

Instead, we focus on bounding the expected Rademacher complexity, i.e. the expectation of 
(|27p when entries in S are chosen independently from a uniform distribution over index pairs. 

ISrebro and Shraibmanl (j2005[ ) bounded the expected Rademacher complexity by O 



(n+m) log'^/^ n 



using a bound of lSeginerl (|2000l) on the spectral norm of a matrix with fixed magnitudes and random 
signs, combined with arguments bounding the number of observations in each row and column. Here 
we present a muc h simp l er ana lysis, reducing the logarithmic factor from log'^^^(n) to log(n), using 
a recent result of iTropd ()2010[) . 

We now proceed to bounding E [|j(5||2], where the expectation is over the sample S and the ran- 
dom signs ^t. Denote Pt = £,tSit,jf, we have Q = Pt and Pt are i.i. d. zero - mean random matrices 
(recall that now both and {it,jt) are random). Theorem 6.1 of iTroppI ()2010t) . combined with 
Remarks 6.3 and 6.5, allows us to bound the expected spectral norm of such a sum of independent 
random matrices by: 

E [IIQII] = O (CTVlog(n + m) + i?log(n + m)) , (28) 
where \\Pt\\2 < R (almost surely) and 

a^=max(||5:E[P,-P,]|[,||^E[P,P,-]||J. 

For each t, Pt is just a matrix with a single -t-1 or —1, hence \\Pt\\ < 1. The matrix PtP^ G IR"^" 
is equal to e,,i with probability i, hence E [PtPf ] = i/„ and || E [PtP/^] jj^ = ||f /ri|| = f- 

Symmetrically, ||X]E [^t^^t]||2 = m ^^'^ '^^ ~ 7^ ™)- Plugging a and T into (|28p we 
have: 



E 



' s{n + m) log(n + m) 



log(n 



= O 



I s{n + m) log(n -|- to) 



nm 



where in the second inequality we assume s > m. Plugging ((29|) into p7|) we get: 
E \n, ({(*, j) ^ X,, I X e E"><™, \\X\\^ < A})] = o 



A {n + m) log(?i + m) 



(29) 



(30) 



C Using Rademacher complexity to bound error 

Let X be any class of matrices. We first discuss the ^i-loss case, in which we would like to bound 



1 

nm 



Y-X{S) 



in expectation over the sample S of size s drawn uniformly at random (with 
replacement) from the matrix. Regarding this as a prediction problem, this is equivalent to bounding 
Es (jIj{X (S))^ . We reformulate a result from lBartlett and MendelsonI (|2001[ ) in the following lemma. 

Lemma 7. Let AI be any matrix in X . Excess reconstruction error can he bounded in expectation 
as 



Es 



L{X{S)) <L{M) + 2ns{X) 



where in this case TZs{X) denotes the expected Rademacher complexity over a sample of size s. 



Proof. Combining Theorems 8 and 12(4) from lBartlett and MendelsonI (|200ll ) (using the last part 
of the proof of Theorem 8 rather than the main statement), since £i-error is a 1-Lipschitz function. 



Es 



sup {L{X) - L{X)) 



< 2nsix) 
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In particular, this implies that 



Es 



L{X{S))-LiX{S)) <2ns{X) 



Furthermore, for any sample 5, by definition we know L{X(S)) < L{M), therefore 



Es 



L{X{S)) - L{M) 



<Es 



L{X{S))-L{X{S)) <2ns{X). 



Finally, note that since M has a fixed value, and does not depend on S, the empirical reconstruction 
error L{M) is an unbiased estimator of L{M), that is, 



Therefore, 



Es 



Es 



L{M) = L{M) = Es [L{M)] 



L{X{S)) <L{M) + 2'Rs{n) . 



□ 



In the I2 case, where L{X) = :;^\Y — ISrebro et al.l (j2010( ) derive a bound, which holds in 
high probability over S, in the following form (which we write with the notation of matrices, but is 
derived generally for any prediction problem): 



sup L{X) ~ L{X) 



Ai • L{X) < A2 



for some Ai, A2 which depend on the Rademacher complexity of X (and on s and other parameters). 
From this, using arguments similar to those used in Lemma [7] above, they derive the bound on 
L{X{S)) that is shown above in ([T^. 
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